Title¶
Using a K-NN Classification Model to Predict the Genre of a given Song based on Danceability and Energy
Introduction¶
The question we want to answer with our project is: “What is the genre of a given song based on its danceability and energy values?” This is a classification question, which uses one or more variables to predict the value of a categorical variable of interest. We will be using the K-nearest neighbours algorithm to predict the genre for our chosen songs. This algorithm tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. Then we select the K number of points which is closest to the test data. The algorithm then calculates the probability of the test data belonging to the classes of ‘K’ training data and the class that has the highest probability will be selected. The dataset we will be using is “Dataset of songs in Spotify'' from Kaggle. This dataset has 22 columns titled: danceability, energy, key, loudness, mode, speechless, acousticness, instrumentalness, liveness, valence, tempo, type, id, uri, track_href, analysis_url, duration_ms, time_signature, genre and song_name. We will be using danceability (from 0-0.99), energy (from 0-1) and genre, specifically those categorized as Emo, Hiphop and hardstyle, in our project.
Preliminary exploratory data analysis¶
library(readr)
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.3 ✔ purrr 1.0.2 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── ✔ broom 1.0.5 ✔ rsample 1.2.0 ✔ dials 1.2.0 ✔ tune 1.1.2 ✔ infer 1.0.4 ✔ workflows 1.1.3 ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1 ✔ parsnip 1.1.1 ✔ yardstick 1.2.0 ✔ recipes 1.0.8 ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Use suppressPackageStartupMessages() to eliminate package startup messages
urlfile="https://raw.githubusercontent.com/brandonzchen/GroupProjDSCI/main/genres_v2.csv"
mydata<-read_csv(url(urlfile))
Rows: 42305 Columns: 22 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (8): type, id, uri, track_href, analysis_url, genre, song_name, title dbl (14): danceability, energy, key, loudness, mode, speechiness, acousticne... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#This is the code for a summary of the information of the data
genres <- c("hardstyle", "Emo", "Hiphop")
datainformation <- mydata |>
select(danceability, energy, genre) |>
filter(genre %in% genres) |>
group_by(genre) |>
summarise(count = n(), mean_energy = mean(energy), mean_danceability = mean(danceability))
datainformation
| genre | count | mean_energy | mean_danceability |
|---|---|---|---|
| <chr> | <int> | <dbl> | <dbl> |
| Emo | 1680 | 0.7611750 | 0.4936988 |
| Hiphop | 3028 | 0.6544179 | 0.6989818 |
| hardstyle | 2936 | 0.8962384 | 0.4780270 |
mydata <- mydata |>
select(danceability, energy, genre) |>
filter(genre %in% genres)
genre_plot <- mydata |>
ggplot(aes(x = energy, y = danceability)) +
geom_point(alpha = 0.4, aes(colour = genre)) +
ggtitle("Scattorplot of the Genres: Emo, hardstyle and Hiphop") +
xlab("Energy") +
ylab("Danceability") +
labs(colour = "Genre") +
theme(text = element_text(size = 18))
options(repr.plot.width = 10, repr.plot.height = 8)
genre_plot
Methods¶
Using the “Dataset of songs in Spotify'' dataset, we will be conducting a K-NN classification on specific songs within the dataset to predict their genre. This will be done by specifically using “danceability” and “energy” as the predictor variables and “genre” as the response variable. We will first filter our dataset to only include danceability, energy and genre as the only 3 variables, tidy the data, and further shrink the data by selecting for the 3 genres: Emo, hardstyle and Hiphop. We will then set aside specific observations from the data which our classifier will be predicting the genre for. We will then begin building, tuning, and evaluating our K-NN classification model. This will include dividing the data up into a training set and testing set, using the training set to build and tune our model through cross-validation and evaluating our chosen K value using the testing set. Finally, we will then use this classification model to predict the selected songs we initially set aside and graph the data using a scatterplot. This scatterplot will include the energy variable in the x-axis, danceability in the y-axis, color coding for each of the 3 genres, as well as a different color indicator for the observations we are predicting for.
Expected outcomes and significance¶
What do you expect to find?
- Correlations between genre and either energy or danceability. For example, we might find that hiphop may have higher danceability scores while hardstyle may have lower danceability scores.
- Correlations between genre and both energy and danceability. For example, we expect to find that songs with relatively higher danceability and energy are more likely to be hiphop songs while lower danceability and energy scores are more likely to be emo songs.
What impact could such findings have?
- These findings can improve the music that is recommended to users in music apps. By exploring the user's preference for danceability and energy in music, the app can better recommend more personalized music based on these trends.
- These findings can help users find music for different occasions. Users can use this information to select the appropriate music or genre for different occasions.
What future questions could this lead to?
- This project can lead to thinking about how to make a more accurate genre predicting model. We can consider and incorporate more features of music that influences genre to make a more comprehensive and accurate model.
- This project can also lead us to be curious about other trends with these variables, such as how genres have evolved over time in terms of danceability and energy.